TinyM2Net: A Flexible System Algorithm Co-designed Multimodal Learning Framework for Tiny Devices

With the emergence of Artificial Intelligence (AI), new attention has been given to implement AI algorithms on resource constrained tiny devices to expand the application domain of IoT. Multimodal Learning has recently become very popular with the classification task due to its impressive performance for both image and audio event classification. This paper presents TinyM2Net - a flexible system algorithm co-designed multimodal learning framework for resource constrained tiny devices. The framework was designed to be evaluated on two different case-studies: COVID-19 detection from multimodal audio recordings and battle field object detection from multimodal images and audios. In order to compress the model to implement on tiny devices, substantial network architecture optimization and mixed precision quantization were performed (mixed 8-bit and 4-bit). TinyM2Net shows that even a tiny multimodal learning model can improve the classification performance than that of any unimodal frameworks. The most compressed TinyM2Net achieves 88.4% COVID-19 detection accuracy (14.5% improvement from unimodal base model) and 96.8% battle field object detection accuracy (3.9% improvement from unimodal base model). Finally, we test our TinyM2Net models on a Raspberry Pi 4 to see how they perform when deployed to a resource constrained tiny device.


INTRODUCTION
Artificial Intelligence (AI) has a huge impact on our daily lives now-a-days. In our daily lives, AI has brought convenience and ease of use to the table. The AI devices are now able to perform computationally intensive tasks and eliminate human error from the system to a large extent, making this convenience possible. We see AI techniques and devices being used in domains such as medical diagnosis, security and combat fields, robotics, vision analytics, knowledge reasoning, navigation, etc. today. To integrate AI in our day-to-day life, it is being implemented on resource constrained mobile and edge platforms. With the exponential growth of resource constrained micro-controller (MCU) and micro-processor (MPU) powered devices, a new generation of neural networks has emerged, one that is smaller in size and more concerned with model efficiency than model accuracy. These low-cost, low-energy MCUs and MPUs open up a whole new world of tiny machine learning (TinyML) possibilities. We can directly do data analytics near the sensor by running deep learning models on very tiny devices, greatly expanding the field of AI applications.
Modern IoT and wearable devices, such as activity trackers, environmental sensors, images, and audio sensors can generate large volumes of data on a regular basis. Modern AI is increasingly reliant on data from numerous sources in order to produce more accurate findings. Learning process for human is multimodal. We human can take our decision by processing different modalities of data. To mimic human-like behavior, AI algorithms should integrate multimodal data as well. Multimodal learning combines disparate, heterogeneous data from a variety of sensors and data sources into a single model. In contrast to standard unimodal learning systems, multimodal systems can convey complimentary information about one another, which becomes apparent only when both are integrated into the learning process. Thus, learning-based systems that incorporate data from many modalities can generate more robust inference or even novel insights, which would be unachievable in a unimodal system. Multimodal learning has two key advantages. To begin, several sensors observing the same data can produce more robust predictions, as recognizing changes in it may need the presence of both modalities. Second, the integration of many sensors enables the capture of complementing data or trends that individual modalities may miss. However, increased model parameters and computations limit multimodal learning to be adopted for resource constrainded edge and tiny ML applications.
Commodity MCUs and MPUs have a very limited resource in terms of memory (SRAM) and storage (Flash) budget. A typical MCU has an SRAM of less than 512kB, which is insufficient for installing the majority of off-the-shelf deep learning networks. Even on more capable hardware such as the Raspberry Pi 4, configuring inference to run in the L2 cache (1MB) can dramatically increase energy efficiency. These new issues add to the difficulty of performing efficient multimodal learning inference with a low peak memory consumption. In this paper, we address this challenge and implement multimodal learning on tiny hardware. We take advantages of state-of-the-art compression techniques and combined them with computationally relaxed layers to implement energy efficient multimodal learning on tiny processing hardware. We proposed a flexible system algorithm co-designed framework TinyM 2 Net which is re-configurable in terms of input data modality and data shapes, number of layers, filter sizes etc. hyper-parameters for the sake of application requirements. We evaluated TinyM 2 Net with two different case-studies: audio processing with multimodal audios and object detection with multimodal images and audios. TinyM 2 Net is then implemented on commodity tiny MPU, Raspberry Pi 4 to measure real-time performance on tiny hardware. The main contributions of this paper are as follows: • Propose TinyM 2 Net, a novel flexible system-algorithm codesigned multimodal learning framework for resource constrained devices. TinyM 2 Net that can take multimodal inputs (images and audios) and be re-configured for application specific requirements. TinyM 2 Net allows the system and algorithms to quickly integrate new sensors data that are customized to various types of scenarios. • Perform network architecture optimization, and mixed-precision quantization with the purpose of decreasing computation complexity and memory size for resource constrained hardware implementation while maintaining accuracy. • Evaluated proposed TinyM 2 Net for two different case-studies.
Case-study 1 includes Covid-19 detection from multimodal cough and speech audio recordings. Case-study 2 includes battlefield object detection using multimodal images and audios. • Implement TinyM 2 Net on commodity microprocessor unit, Raspberry Pi 4. We measured inference time while it was in use, as well as providing the appropriate power profiling to ensure that our system is adaptable. To be called a real-time implementable tinyml system, TinyM 2 Net meets all of the requirements.

RELATED WORKS
Authors in [17] presented a high level overview on the optimization techniques of deep neural networks (DNNs) for tinyML on device inferences. TinyML model optimization includes different algorithms of Parameter Search, Sparsification and Quantization techniques. Element-wise pruning [19], Structured Pruning [4,13] these techniques reduces the unimportant weights and compress the models to be implemented on tinyML devices. Extreme low precision quantization [3,14] and mixed precision quantization [9,11,28,29,31] is adopted by the researchers to decrease the memory requirements of the DNN models. MCUNet [15], MicroNets [6] are proposed to deploy DNN models on micro-controller units (MCU). Recently, multimodal learning attracts researchers to improve the classification accuracy of the models integrating different modalities of data fusion [1,7,16,21]. However, implementation of multimodal learning into resource constrained tiny hardware is very limited due to its large model sizes. We present a novel multimodal learning framework TinyM 2 Net which is system algorithm co-designed and different model compression techniques were implemented to compress the large multimodal models to be implemented on tiny devices.
3 TinyM 2 Net FRAMEWORK Figure 1 shows the proposed TinyM 2 Net framework along with its detailed architecture. Based on the case studies we mention in section 5, TinyM 2 Net is able to integrate two different modalities of data and classify them. Proposed TinyM 2 Net is designed based on mainly convolutional neural networks (CNN). CNN performed very promising with images and audio data classification previously which is the reason behind choosing CNN as our base model. We will evaluate TinyM 2 Net on two different case studies described in details in section 5. Case-study 1 is detecting COVID-19 signatures using two different modalities of audio recordings, cough sound and speech sound. Case-study 2 is based on detecting battle field object using images and audios. While processing audio data, we have divided the whole audio into shorted window frames, the size of those frames are variable based on application requirements. Then the window frames are converted into Mel-Frequency Cepstral Coefficients (MFCCs) spectrograms. In the next step, two different data modalities, whether be images or audios, are sent to the CNN layers for feature extraction. The number of layers of our CNN layers can be adjusted to suit application specific requirements. Maxpooling layers is used to reduce the size of the feature map. A number of fully interconnected layers are applied once the output is flattened to achieve the needed tiny feature map size in order to isolate enough information with linkages between nodes. The outputs from two parallel feature extraction layers from two data modalities are then concatenated and processed through fully connected layers to produce the final label. The activation function for each layer is a rectified linear unit (ReLU). Softmax activation function is used to generate a probability distribution for the final layer.

MODEL COMPRESSION FOR TINY DEVICES
Traditional CNN models are very notorious for being bulky in terms of memory and computation requirements. To implement CNN on low powered embedded tiny devices, researchers proposed various compression techniques which results in highly optimized CNN models. Our proposed TinyM 2 Net adopts different model compression techniques which optimizes the network architectures and memory requirements. TinyM 2 Net adopts Depthwise Separable CNN (DS-CNN) to reduce the computation from traditional CNN layers. To have improvement on memory requirements, TinyM 2 Net adopts low precision and mixed-precision (MP) quantization. We emphasize on MP quantization as uniform low precision quantization degrades model accuracy. Figure 3 presents the conventions and the techniques that is done in traditional CNN and DS-CNN . In traditional CNN, if the input is of size × × and is the number of filters having a size × × then output of this layer without zero padding applied is of size × × . If the stride for the convolution is then is determined by the following equation:

Network Architecture Optimization with DS-CNN
In this layer, the filter convolves over the input by performing element wise multiplication and summing all the values. A very important note is that depth of the filter is always same as depth of the input given to this layer. The computational cost for traditional convolution layer is Depthwise Separable convolution is a combination of depthwise and pointwise convolution [12]. In contrast to traditional CNNs, which apply convolution to all channels at once, depthwise operations only apply convolution to a single channel at a time. So here the filters/kernels will be of size × × 1. As there are channels at the input, numbers of such filters are needed. This will produce a output of size × × . A single convolution operation require × multiplications. Since the filter are slided by × times across all the channels. The total number of computation for one depthwise convolution comes to be × 2 × 2 . In point-wise operation, a 1 × 1 convolution is applied on the channels. The filter shape for this operation will be 1 × 1 × . If we use such filters, the output shape becomes × × . One convolution operation in this needs 1 × multiplications. The total number of operations for one pointwise convolution is × 2 × . Therefore, total computational cost of one depthwise separable convolution is × 2 × 2 + × 2 × [10]. Table 1 shows the equations required for calculating the number of parameters and number of computations for each of the traditional CNN and DS-CNN layers. Here × is the size of the filter, × is the size of the output, is number the of input channels and is the number of output channels.

Model Quantization
To have lesser memory requirements, model quantization is now attracting to the researchers to design tinyML models. The accuracy of a model can be significantly degraded if it is uniformly quantized to low bit precision. It is possible to address this with mixed-precision quantization in which each layer is quantized with different bit precision. The main idea behind mixed precision quantization is to keep sensitive layers at higher precision and insensitive layers at lower precision. However, the search space for this bit setting grows exponentially as the number of layers increase, which is challenging. Various methods have been offered to deal with this enormous search area. Reinforcement learning (RL) and Neural Architecture Search (NAS) have recently been presented as efficient approaches for searching the search space. As a result, these methods [9,28,29] often require a huge amount of computational resources and their performance is highly dependent on hyperparameters and even initialization. An algorithm known as Integer Linear Programming (ILP) is employed in [11,31]. ILP is very light weight and gives result within second. We adopted ILP and formulated our problem following the methodology described in [31], simplifying some constraints to get the mixed precision settings for our TinyM 2 Net. ILP equations were solved using a python module called Pulp.
To tackle the accuracy degradation with extreme low bit precision quantization, we chose two different bit precision settings ( = 2), INT4 and INT8 for our TinyM 2 Net framework. As our TinyM 2 Net is flexible in terms of number of layers, for a model with layers, the search space for ILP becomes . ILP will find the best bit precision choices from search spaces to have optimal tradeoff between model perturbation Ω and user specific constraints i.e. model size and Bit Operations, BOPS. Each of these bit-precision options has the potential to produce different model perturbation. We assumed the perturbation for each layer are independent to each other [31] (i.e., Ω = =1 Ω , where Ω is the perturbation of i-th layer with bit). This enables us to pre-calculate the sensitivity of each layer independently, with only computations required. Hessian based perturbation, presented in [31] is used as sensitivity metric. Minimizing this sensitivity, ILP tries to find the right bit precision settings. The ILP equations would be as follows: Objective: Subject to: Here, ( ) denotes the size of the -th layer with bit quantization and ( ) is the corresponding BOPS required for computing that layer. All the equations are adopted from [31]. We have considered the bit precision for both weights and activations to be same so that the mathematical operations become efficient. The overall MP quantization process is summarized in figure [

Case Study 1: Covid Detection from Multimodal Audio Recordings
Combining numerous data sources has always been a high priority topic, but with the advent of new AI-based learning algorithms, it has become critical to combine the complementary capabilities of distinct data sources for effective diagnosis, treatment, prognosis, and planning in a variety of medical applications. With the onset of COVID-19 pandemic, patient pre-screening from passively recorded audios has become an active area of research. Therefore, a bunch of unimodal and multimodal COVID-19 audio dataset have been presented [5,8,20,24]. The ultimate goal in this research is to provide COVID-19 pre-screening mobile or tiny devices. Figure 4 shows the highlevel overview of the evaluation of TinyM 2 Net in terms of COVID-19 detection.  [23]. This dataset is a subset of the bigger dataset [8] collected by University of Cambridge. In this dataset there are 929 cough audios from 397 participants and 893 speech recordings from 366 participents. Each recording included a COVID-19 test result that was self-reported by the participant. To build the two-class classification task, the original COVID-19 test results were mapped to positive (designated as 'P') or negative (designated as 'N') categories.

Experimental Setups, Results and Analysis.
To create a balanced multimodal dataset we have taken 893 cough recordings from 366 participants matching their IDs mentioned in the metadata so that we have both cough and speech recordings contributing from a same person. Then we have divided the audios into 2 sec audio chunks and produces 6000 random samples out of that. Then we converted them into MFCC spectrogram and passed them to TinyM 2 Net. TinyM 2 Net process two different modalities of audios with its parallel CNN layers, extracts features, combines them and classify at the end as binary classification. We have used the 1st layer as traditional CNN and the later layers as DS-CNN. The detailed network architecture is mentioned in table 2. We trained our model with categorical cross-entropy loss and Adam optimizer. We achieved 90.4% classification accuracy with FP32 bit precision. We then quantize our model to uniform 8-bit and

Case Study 2: Battlefield Component Detection from Multimodal Images and Audios
Research in the field of computer vision has always focused on the detection of specific targets in an image. Research on the identification of armored vehicles in the battlefield environment, as well as the deployment dynamics, recognition and tracking, precise strike, and so on, are critical military objectives. There are still many difficulties in detecting armored vehicles on the battlefield because of the complication of the environment [30,32]. Authors in [18,25] proposed multimodal learning approach in object detection task.
We have presented a novel multimodal learning approach in battlefield object detection based on image and audio modality. Figure 5 shows the highlevel overview of the evaluation of TinyM 2 Net in terms of battlefield object detection.  We trained our model with categorical cross-entropy loss and Adam optimizer. We achieved 98.5% classification accuracy with FP32 bit precision. We then quantize our model to uniform 8-bit and 4-bit precision and achieved 97.9% and 88.7% classification accuracy. Our MP quantization technique improves the classification accuracy to 97.5% which is very comparable to both 8-bit and 32bit quantized models. We achieved 93.6% classification accuracy with unimodal (only image data) implementation. Our multimodal approach improved the object detection accuracy to 3.9%.  We used a batch size of 1, which is the time it takes to process a single data point, to calculate the inference time. We also need to consider the model's power profile when deploying it in the actual world. The running power of any deep model should be well within the device's sustainable range. Power consumption is calculated by deducting idle power from the peak power indicated during inference operation and reporting the result. For reporting, we use the metric unit milliwatt (mW), and a USB power meter has been employed. The inference time and required power during inference with the most compressed models are mentioned in table 5 for both of the case studies.

CONCLUSION
This paper presents TinyM 2 Net, a flexible system algorithm codesigned multimodal learning framework which employs as much as correlated information a multimodal dataset provides in an attempt to exploit deep learning algorithms evaluating two important tinyML evaluation case-studies: to detect the signature of COVID-19 into participents' cough and speech sound and battle field object detection from multimodal images and audios. To implement into tiny hardware, extensive model compression was done in terms of networks architecture optimization and MP quantization (mixed 8-bit and 4-bit). The most compressed TinyM 2 Net achieves 88.4% COVID-19 detection accuracy and 96.8% battle field object detection accuracy. Finally, we test our TinyM 2 Net model on a Raspberry Pi 4 to see how they perform when deployed to a resource constrained tiny device.

ACKNOWLEDGMENT
We acknowledge the support of the U.S. Army Grant No. W911NF21-20076.